Overview

The purpose of this study is to explore possible correlations between repository and event parameters in order to determine a sampling methodology for Github repositories based on public Github event activity.

Github is huge. With over 10 million repositories, the Github repository population is highly variable, thus making analysis incredibly challenging.

The solution explored in this study uses activity data collected by the Github Archive project. The Github Archive Project keeps a regularly up-to-date archive of the public event stream available through the Github API. The data are available as a public data set on Google BigQuery.

Events data allow for time-based stratification of Github repositories. For purposes of this research, only recently active repositories are of interest, however even research exploring the lifespan of repositories could benefit from this stratification.

This study is part of a larger research project to answer the following questions: